Video captioning has been attracting broad research attention in multimediacommunity. However, most existing approaches either ignore temporal informationamong video frames or just employ local contextual temporal knowledge. In thiswork, we propose a novel video captioning framework, termed as\emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply capturesbidirectional global temporal structure in video. Specifically, we first devisea joint visual modelling approach to encode video data by combining a forwardLSTM pass, a backward LSTM pass, together with visual features fromConvolutional Neural Networks (CNNs). Then, we inject the derived videorepresentation into the subsequent language model for initialization. Thebenefits are in two folds: 1) comprehensively preserving sequential and visualinformation; and 2) adaptively learning dense visual features and sparsesemantic representations for videos and sentences, respectively. We verify theeffectiveness of our proposed video captioning framework on a commonly-usedbenchmark, i.e., Microsoft Video Description (MSVD) corpus, and theexperimental results demonstrate that the superiority of the proposed approachas compared to several state-of-the-art methods.
展开▼